There are several required columns in the GWAS data input for ezQTL. These columns include: chromosome (chr), position (pos), reference allele (ref), alternative allele (alt), rsnum, pvalue, zscore (with direction being with regard to the alt allele), effect, and standard error of effect (se). Below is an example table including GWAS summary data from NHGRI-EBI GWAS Catalog, which have been formatted to ezQTL input according to this instruction:
## trait chr pos ref alt rsnum pvalue
## 1: GWAS_GCST004415-ovarian-carcinoma 10 60523 T G rs112920234 0.81970
## 2: GWAS_GCST004415-ovarian-carcinoma 10 60684 A C rs569167217 0.17910
## 3: GWAS_GCST004415-ovarian-carcinoma 10 60969 C A rs61838556 0.09314
## 4: GWAS_GCST004415-ovarian-carcinoma 10 61283 G A rs532801013 0.96930
## 5: GWAS_GCST004415-ovarian-carcinoma 10 61331 A G rs548639866 0.17970
## 6: GWAS_GCST004415-ovarian-carcinoma 10 61334 G A rs183305313 0.87380
## 7: GWAS_GCST004415-ovarian-carcinoma 10 61372 CA C rs147855157 0.48080
## 8: GWAS_GCST004415-ovarian-carcinoma 10 61419 G A rs553163044 0.80290
## 9: GWAS_GCST004415-ovarian-carcinoma 10 62450 G A rs539912981 0.76890
## 10: GWAS_GCST004415-ovarian-carcinoma 10 63213 G C rs542543788 0.17910
## zscore effect se
## 1: -0.2279310 -0.101400 0.44470
## 2: -1.3435300 -0.058330 0.04341
## 3: -1.6790600 -0.029280 0.01744
## 4: 0.0384862 0.008872 0.23040
## 5: -1.3416800 -0.058250 0.04341
## 6: 0.1588340 0.019670 0.12380
## 7: -0.7050160 -0.012000 0.01702
## 8: 0.2495960 0.010300 0.04127
## 9: -0.2938140 -0.076250 0.25950
## 10: -1.3435300 -0.058330 0.04341
Several GWAS datasets included in ezQTL are from the NHGRI-EBI GWAS catalog. Included here is an example GWAS formatting workflow to take a data file from the NHGRI-EBI GWAS catalog and format it to be an appropriate and usable input for ezQTL. This workflow includes two steps. The first step is an R script, which takes the original GWAS data file and formats the data to include the required columns for ezQTL that were mentioned above. The second step is a bash script, which takes the output file from step one and matches the reference and alternative alleles to the reference and alternative alleles in the dbSNP database. This step allows for the correct direction of the zscore (and effect) to be determined so that this direction is with regard to the alternative allele. Below is a more detailed explanation of each of these two steps, as well as the example scripts.
Above, we can see that both the beta column and standard_error column are “NA”, causing the need for calculations to determine the correct effect, standard error, and zscore. The steps from the R script,Format_GWAS_single_OR.R, are explained below. There is one input argument for running the script, which is the name of the file that you wish to format.
library(tidyverse)
chr<- c('chromosome','Chromosome','CHR','#CHR','#CHROM','Chr')
pos <- c('Pos','base_pair_location','Position','POS','pos')
ref <- c('other_allele','REF','Allele1','NonEffectAllele','Baseline','NEA')
alt <- c('effect_allele','EffectAllele','Effect','EA','ALT','Allele2')
rsnum <- c('ID','rsnum','variant_id','#MARKER','OrigSNPname','SNP')
pvalue <- c('p_value','overall_pvalue','P','PVALUE','pval','P-value')
effect <- c('beta','BETA','EFFECT1','Effect')
se <- c('standard_error','overall_SE','SE','STDERR','sebeta')
odds_ratio <- c('odds_ratio','overall_OR','OR')
gwasfile variable to the first argument.gwasfile <- args[1]
trait_file variable (will become first column in output table) and short_file_name variable (will be used to name the output file).trait_file <- str_remove(str_remove(gwasfile,'.*/GWAS_'),'_[^_]*_[0-9]*.tsv.gz$')
short_file_name <- str_remove(gwasfile,'.*/')
gwas_data <- read_delim(gwasfile,delim = '\t',col_names = T,col_types = cols('chromosome'='c'))
colnames(gwas_data) [colnames(gwas_data) %in% chr] <- 'chr'
colnames(gwas_data) [colnames(gwas_data) %in% pos] <- 'pos'
colnames(gwas_data) [colnames(gwas_data) %in% ref] <- 'ref'
colnames(gwas_data) [colnames(gwas_data) %in% alt] <- 'alt'
colnames(gwas_data) [colnames(gwas_data) %in% rsnum] <- 'rsnum'
colnames(gwas_data) [colnames(gwas_data) %in% pvalue] <- 'pvalue'
colnames(gwas_data) [colnames(gwas_data) %in% effect] <- 'effect'
colnames(gwas_data) [colnames(gwas_data) %in% se] <- 'se'
colnames(gwas_data) [colnames(gwas_data) %in% odds_ratio] <- 'odds_ratio'
trait, and select for the required columns that have now been renamed: trait, chr, pos, ref, alt, rsnum, pvalue, effect, se, odds_ratio.gwas_data <- gwas_data %>% mutate(trait = trait_file) %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,effect,se,odds_ratio)
The code to determine the effect is as follows:
if(sum(is.na(gwas_data$effect)) > 0){
if((sum(gwas_data$odds_ratio <0)) >= 1){
gwas_data$effect <- gwas_data$odds_ratio}
else{
gwas_data$effect <- log(gwas_data$odds_ratio)}
}else{
gwas_data$effect <- gwas_data$effect}
First, the if statement checks to see if any of the current values pulled for the effect column are “NA”. If there are not any “NAs” in the effect column, then the effect column remains. However, if there is one “NA” value or more, the code moves into the nested if statement, which looks at the values in the odds_ratio column. If any values in the odds_ratio column are negative, it is assumed that this is the effect column. This is because the odds_ratio cannot be negative, and so if values in this column are negative, the log(odds_ratio) was already calculated, and this is the value found here. The effect is equivalent to the log(odds_ratio).
The code to calculate the zscore is as follows:
gwas_data <- gwas_data %>% mutate(zscore=if_else(effect<0,-1*abs(qnorm(pvalue/2)),abs(qnorm(pvalue/2)))) %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,zscore,effect,se)
To calculate the zscore correctly, it must be determined what the sign of the log(odds_ratio) (i.e. effect) is. If the effect is less than 0, the sign(logOR) will be -1. In this case, we expect the zscore to be less than 0. In contrast, if the effect is greater than 0, the sign(logOR) will be +1. Here, we expect the zscore to be greater than 0. After determining if an effect is greater than or less than 0, this sign (-1/+1) is multiplied by the abs(qnorm(pvalue/2)) to find the zscore.
The code to calculate se is as follows:
if(sum(is.na(gwas_data$se)) > 0){
gwas_data$se <-gwas_data$effect/gwas_data$zscore
}else{
gwas_data$se <- gwas_data$se}
As with the effect determination, to find se, again the number of “NAs” are tabulated for the se column. If there is one “NA” or more, then the se is calculated by dividing the effect by the zscore. (This is just a manipulated version of the common zscore formula, zscore=effect/se). If there are no “NAs” in the previously selected se column, this se column remains. Otherwise, the se column remains as is from the original file.
gwas_data <- gwas_data %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,zscore,effect,se)
gwas_data %>% write_delim(paste0('../formated/GWAS_NHGRI_EBI/GWAS_NHGRI_EBI_',short_file_name),delim = '\t',col_names = T)
Below is the entire Format_GWAS_single_noOR.R script:
# 1: required R package
library(tidyverse)
args = commandArgs(trailingOnly=TRUE)
# 2: possible names for each column for each of the data files in the NHGRI-EBI GWAS Catalog
chr<- c('chromosome','Chromosome','CHR','#CHR','#CHROM','Chr')
pos <- c('Pos','base_pair_location','Position','POS','pos')
ref <- c('other_allele','REF','Allele1','NonEffectAllele','Baseline','NEA')
alt <- c('effect_allele','EffectAllele','Effect','EA','ALT','Allele2')
rsnum <- c('ID','rsnum','variant_id','#MARKER','OrigSNPname','SNP')
pvalue <- c('p_value','overall_pvalue','P','PVALUE','pval','P-value')
effect <- c('beta','BETA','EFFECT1','Effect')
se <- c('standard_error','overall_SE','SE','STDERR','sebeta')
odds_ratio <- c('odds_ratio','overall_OR','OR','todds_ratio')
# 3: set gwasfile variable to the first argument
gwasfile <- args[1]
# 4: trait_file variable (will become first column in output table) and short_file_name (will be used to name the output file)
trait_file <- str_remove(str_remove(gwasfile,'.*/GWAS_'),'_[^_]*_[0-9]*.tsv.gz$')
short_file_name <- str_remove(gwasfile,'.*/')
# 5: read in the gwasfile that you wish to format
gwas_data <- read_delim(gwasfile,delim = '\t',col_names = T,col_types = cols('chromosome'='c'))
# 6: Name each of the columns what it is required to be for GWAS input data in ezQTL
colnames(gwas_data) [colnames(gwas_data) %in% chr] <- 'chr'
colnames(gwas_data) [colnames(gwas_data) %in% pos] <- 'pos'
colnames(gwas_data) [colnames(gwas_data) %in% ref] <- 'ref'
colnames(gwas_data) [colnames(gwas_data) %in% alt] <- 'alt'
colnames(gwas_data) [colnames(gwas_data) %in% rsnum] <- 'rsnum'
colnames(gwas_data) [colnames(gwas_data) %in% pvalue] <- 'pvalue'
colnames(gwas_data) [colnames(gwas_data) %in% effect] <- 'effect'
colnames(gwas_data) [colnames(gwas_data) %in% se] <- 'se'
colnames(gwas_data) [colnames(gwas_data) %in% odds_ratio] <- 'odds_ratio'
# prints the trait_file variable so that you can keep track of where the script is in processing (example in the logs output and error files)
print(trait_file)
# 7: Add the trait_file variable to the table as the first column, ‘trait’, and select for the required columns that have now been renamed: trait, chr, pos, ref, alt, rsnum, pvalue, effect, se, odds_ratio
gwas_data <- gwas_data %>% mutate(trait = trait_file) %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,effect,se,odds_ratio)
# 8: determination of and calculations for effect, zscore, and se
# effect
if(sum(is.na(gwas_data$effect)) > 0){
if((sum(gwas_data$odds_ratio <0)) >= 1){
gwas_data$effect <- gwas_data$odds_ratio}
else{
gwas_data$effect <- log(gwas_data$odds_ratio)}
}else{
gwas_data$effect <- gwas_data$effect}
#print(gwas_data$effect)
# zscore
gwas_data <- gwas_data %>% mutate(zscore=if_else(effect<0,-1*abs(qnorm(pvalue/2)),abs(qnorm(pvalue/2)))) %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,zscore,effect,se)
# se
if(sum(is.na(gwas_data$se)) > 0){
gwas_data$se <-gwas_data$effect/gwas_data$zscore
}else{
gwas_data$se <- gwas_data$se}
# 9: select the columns needed for GWAS data in ezQTL
gwas_data <- gwas_data %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,zscore,effect,se)
# 10: Write to a file in a formatted directory
gwas_data %>% write_delim(paste0('../formated/GWAS_NHGRI_EBI/GWAS_NHGRI_EBI_',short_file_name),delim = '\t',col_names = T)
Below is a screenshot of the entire script:
However, other GWAS datasets from the NHGRI-EBI catalog do not include an ‘odds_ratio’ column, and instead just have a complete ‘effect’ or ‘beta’ column already present. Therefore, this requires slightly less calculations than the first script, but is still very similar.
In the script Format_GWAS_single_noOR.R, steps 1-7, and 10 are the same as they were described in the first scenario. The only step that differs is step 8. Since the files without the odds_ration column have a beta (effect) column, and a standard error column, calculations only need to be done to find the zscore.
gwas_data <- gwas_data %>% mutate(zscore=effect/se) %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,zscore,effect,se)
The zscore is calculated by dividing the effect by se. After calculating the zscore, select for the columns needed to run GWAS data through ezQTL. (This is combined with step 9 in the first scenario). Then the result is written to a file.
Below is the entire Format_GWAS_single_noOR.R script:
# 1: required R package
library(tidyverse)
args = commandArgs(trailingOnly=TRUE)
# 2: possible names for each column for each of the data files in the NHGRI-EBI GWAS Catalog
chr<- c('chromosome','Chromosome','CHR','#CHR','#CHROM','Chr')
pos <- c('Pos','base_pair_location','Position','POS','pos')
ref <- c('other_allele','REF','Allele1','NonEffectAllele','Baseline','NEA')
alt <- c('effect_allele','EffectAllele','EA','ALT','Allele2')
rsnum <- c('ID','rsnum','variant_id','#MARKER','OrigSNPname','SNP')
pvalue <- c('p_value','overall_pvalue','P','PVALUE','pval','P-value')
effect <- c('beta','BETA','EFFECT1','Effect','overall_OR')
se <- c('standard_error','overall_SE','SE','STDERR','sebeta')
# 3: set gwasfile variable to the first argument
gwasfile <- args[1]
# 4: trait_file variable (will become first column in output table) and short_file_name (will be used to name the output file)
trait_file <- str_remove(str_remove(gwasfile,'.*/GWAS_'),'_[^_]*_[0-9]*.tsv.gz$')
short_file_name <- str_remove(gwasfile,'.*/')
# 5: read in the gwasfile that you wish to format
gwas_data <- read_delim(gwasfile,delim = '\t',col_names = T)
# 6: Name each of the columns what it is required to be for GWAS input data in ezQTL
colnames(gwas_data) [colnames(gwas_data) %in% chr] <- 'chr'
colnames(gwas_data) [colnames(gwas_data) %in% pos] <- 'pos'
colnames(gwas_data) [colnames(gwas_data) %in% ref] <- 'ref'
colnames(gwas_data) [colnames(gwas_data) %in% alt] <- 'alt'
colnames(gwas_data) [colnames(gwas_data) %in% rsnum] <- 'rsnum'
colnames(gwas_data) [colnames(gwas_data) %in% pvalue] <- 'pvalue'
colnames(gwas_data) [colnames(gwas_data) %in% effect] <- 'effect'
colnames(gwas_data) [colnames(gwas_data) %in% se] <- 'se'
# prints the trait_file variable so that you can keep track of where the script is in processing (example in the logs output and error files)
print(trait_file)
# 7: Add the trait_file variable to the table as the first column, ‘trait’, and select for the required columns that have now been renamed: trait, chr, pos, ref, alt, rsnum, pvalue, effect, se
gwas_data <- gwas_data %>% mutate(trait = trait_file) %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,effect,se)
# 8: zscore calculation and select columns needed to run GWAS data through ezQTL
gwas_data <- gwas_data %>% mutate(zscore=effect/se) %>% select(trait,chr,pos,ref,alt,rsnum,pvalue,zscore,effect,se)
# 9: Write to a file in a formatted directory
gwas_data %>% write_delim(paste0('../formated/test/GWAS_NHGRI_EBI_',short_file_name),delim = '\t',col_names = T)
Below is a screenshot of the entire Format_GWAS_single_noOR.R script:
The second step of the GWAS formatting process is running a bash script to match the reference and alternative alleles in the dataset to those considered the reference and alternative alleles in dbSNP. This is to determine the correction direction of the zscore (and effect) so that it is in reference to the alternative allele (in dbSNP).
The two input arguments in the bash script, ezQTL_match.sh, are the input file, and the desired output file name. In the script, a series of awk statements read the dbSNP file (which contains known rsIDs, chromosome, position, ref, and alt alleles), and generates an array and key. This is followed by two keys being established for the input file. If the reference allele in the input file matches the reference allele in the dbSNP file, the direction of the zscore and effect remains unchanged. However, if the reference allele in the input file is the alternative allele in the dbSNP file, then the direction of the zscore and effect are changed. Below is the ezQTL_match.sh script:
#/usr/bin/sh
File=$1
Out=$2
zcat $File| awk -F "\t" -v OFS="\t" -v out=$Out ' \
NR==FNR{split($5,x,","); for(i in x) { $5=x[i]; key=$2":"$3":"$4":"x[i]; a[key]=$2"\t"$3"\t"$4"\t"x[i]"\t"$1;}; next;} \
FNR==1{print "#trait\tchr\tpos\tref\talt\trsnum\tpvalue\tzscore\teffect\tse"; next;} \
{key1=$2":"$3":"$4":"$5; key2=$2":"$3":"$5":"$4; dir=1; if(key1 in a) {value=a[key1]; dir=1; n1++;} else {if(key2 in a) {value=a[key2]; dir=-1; n2++;} else {value=$2"\t"$3"\t"$4"\t"$5"\t"$6;n3++;}}; \
$8=$8*dir; $9=$9*dir; print $1,value,$7,$8,$9,$10; \
}END{print n1,n2,n3 >out"_match_info.txt"}' \
/data/ITEB-BRCA/ezQTL/QTL_data/00-All.vcf.id - |bgzip >$Out
wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz*
bcftools query -f '%ID\t%CHROM\t%POS\t%REF\t%ALT\n' 00-All.vcf.gz >00-All.vcf.id
removerows_sort_index.sh to do so. This simple script removes rows where the pvalue is less than 0, sorts the file by chromosome and position, and indexes the file. Note the different directories the files generated are placed into as the script is run. The final data and index files will be in the sorted/ directory. Below is the script:#/usr/bin/sh
Out=$1
# remove rows with pval not greater than 0
zcat $Out | awk '{if($7>0) print $0}'| bgzip -c > refined/$Out
cd refined/
# sort
zcat $Out |sort -k2,2n -k3,3n | bgzip -c > sorted/$Out #sort file
cd sorted/
# generate index file
tabix -c "#" -s 2 -b 3 -e 3 $Out